Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 28 23:02:05 2021"

As a former computer science major, so far everything seems to have been pretty easy. Some problems with the DataCamp, it doesn’t always accept correct R scripts. I expect to learn abour using R to do my statistical work in my Thesis. Previously I have used SPSS, but always seemed to be lacking in the programming side. At least for now, R seems more familiar as a language due to, again, my former computer science experience.

My Diary: https://mvannas.github.io/IODS-project/

My Repository: https://github.com/mvannas/IODS-project

FYI: SSH authentication is now required by GitHub, while the installation of the public key in the Guthub was simple, I used instructions in https://cyberhelp.sesync.org/faq/set-up-gitlab-ssh-key.html to configure RStudio correctly for SSH key. Also if its the first time you PUSH commit to Github and the PUSH asks something, you need to answer ‘yes’ to install the host key.


Regression and model validation

test_data <- read.csv("data/work_data.csv")

str(test_data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...

The dataset is originally collected between 3.12.2014 and 10.1.2015. It a combined survey with questions from ASSIST (Approaches and Study Skills Inventory for Students) and SATS (Survey of Attitudes Toward Statistics). A subset of this data is used in this exercise that contains participant background of gender (binary) and age, and a combined mean score for deep, strategic and surface approach for learning. Also contained in our subset is the patients total points (from ASSIST) and attitude towards statistics (from SATS). After removing cases with zero points, 166 cases remained and 7 variables each.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
test_data %>% group_by(gender) %>% summarise_at(vars(age, points,deep,stra, surf, attitude), list(name = mean))
## # A tibble: 2 × 7
##   gender age_name points_name deep_name stra_name surf_name attitude_name
##   <chr>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>
## 1 F          24.9        22.3      3.66      3.20      2.83          29.9
## 2 M          26.8        23.5      3.72      2.96      2.70          34.4
p <- ggpairs(test_data, mapping = aes(col = gender), upper = list(continuous = wrap('cor', size = 3)), lower = list(combo = wrap("facethist", bins = 20)))
p

Here you can see the summary statistics and relationships between variables. There are more females than male participants, but there are no immediately observable differences between male and female score or points. Also the age average age of the participants is 24.9 (females) and 26.8 (males) with some outliers in both genders being clearly older.

Deep and surface approaches for learning seem to have a negative correlation for points, while strategic approach has positive correlation. Attitude also has a strong positive correlation to the overall points score.

model <- lm(points ~ attitude + stra + surf, data = test_data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = test_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Based on the previous correlation estimates, I selected attitude, strategic learning approaches score and surface approach learning score for multiple regression analysis. This summary shows that attitude has statistically the most significant, however estimated coefficient size is smallest (0.33952). It should be noted that this score is has not been averaged to 0-5 scale similarly to deep, stra and surf scores. Two options exist, you can either scale the explanatory variables to same or you can calculate standardized coefficient estimates for which there are ready R libraries.

library(QuantPsyc)
## Loading required package: boot
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
## 
##     norm
lm.beta(model)
##    attitude        stra        surf 
##  0.42039820  0.11170276 -0.05257769

The surf score seems to have the least estimate of coefficient, so I remove it from the equation and recalculate.

model <- lm(points ~ attitude + stra, data = test_data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = test_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Reducing the amount of variables to just attitude and stra, but still stra score does not seem to be statistically significant (using p under 0.05 for statistical significance). Removing surf score did increase adjusted R-squared, which means this model is a better fit the observed data.

model <- lm(points ~ attitude, data = test_data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = test_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Running the model with just attitude results in R-squared being lower. Thus I would conclude that the previous model with stra and attitude as the explanatory models is a better fit.

The multiple R-squared combines the variation of different explanatory variables in the model. Higher score is better, up to 1. 0 means the explanatory parameter do not explain any variation of observed values and 1 means they explain all the variations. The adjusted R-squared takes into account how much an added explanatory variable should increase the value, thus it grows only if the added variable add something useful to the model over pure chance.

Our linear model has some assumptions that the data used to model has to fulfill. Firstly, the model is linear, so the relationship between explanatory variables and target must be linear. Also usually it is assumed that the errors are normally distributed. The validity of these assumptions can be explored by looking at the residual plots.

model <- lm(points ~ attitude + stra, data = test_data)
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))

Residuals vs Fitted values plot is used to check the model for constant variance of errors. Here in the top left panel we see our residuals vs fitted graphics. The scatter plot seems equally scattered without any obvious patterns. Therefore we conclude that our model when used with our data does have fairly constant variance of errors.

The normal Q-Q plot is used to check that our errors are normally distributed. The plot should show fairly little deviation from the line to be pass this check. Again in our model with our data it would seem that the errors are normally distributed.

The Residuals vs Leverage plot shows the residuals and Cook’s distance. Cook’s distance is estimation of influence of a observation on the parameters of the model. Usually values over 1 are suggestive of undue influence. Our model shows that while here are few observations that seem to have slightly more influence, the Cook’s value is still very low and thus there are no observation with undue power over the model.


Chapter 3 is broken.


Clustering and Classification

Our next exercise uses on the ready dataset available, the Boston-dataset. This dataset has information on the city of Boston with descriptive variables for observed suburbs.

library(MASS)
library("ggplot2")                     
library("GGally")
library("corrplot")
## corrplot 0.90 loaded
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.5     ✓ purrr   0.3.4
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library("vtable")
## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

There are 506 observations with 14 different variables.

Variable Explanation
rim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted mean of distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000s

Information on variables from https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

ggpairs(Boston, upper = list(continuous = wrap('cor', size = 2)))

ggcorr(cor(Boston), geom = "circle", nbreaks = 11)

sumtable(Boston)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
crim 506 3.614 8.602 0.006 0.082 3.677 88.976
zn 506 11.364 23.322 0 0 12.5 100
indus 506 11.137 6.86 0.46 5.19 18.1 27.74
chas 506 0.069 0.254 0 0 0 1
nox 506 0.555 0.116 0.385 0.449 0.624 0.871
rm 506 6.285 0.703 3.561 5.886 6.624 8.78
age 506 68.575 28.149 2.9 45.025 94.075 100
dis 506 3.795 2.106 1.13 2.1 5.188 12.126
rad 506 9.549 8.707 1 4 24 24
tax 506 408.237 168.537 187 279 666 711
ptratio 506 18.456 2.165 12.6 17.4 20.2 22
black 506 356.674 91.295 0.32 375.378 396.225 396.9
lstat 506 12.653 7.141 1.73 6.95 16.955 37.97
medv 506 22.533 9.197 5 17.025 25 50

From the summary statistics we can observe that the variables have wide range of values, chas is only a twofold category with values 0 or 1, to a range of 0 to 400 (black population proportion). With the exception of chas, most of the variables seem to have fairly strong correlations with at least some of the other values.

bsca <- scale(Boston)
bsca <- as.data.frame(bsca)
sumtable(bsca, add.median = TRUE)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
crim 506 0 1 -0.419 -0.411 -0.39 0.007 9.924
zn 506 0 1 -0.487 -0.487 -0.487 0.049 3.8
indus 506 0 1 -1.556 -0.867 -0.211 1.015 2.42
chas 506 0 1 -0.272 -0.272 -0.272 -0.272 3.665
nox 506 0 1 -1.464 -0.912 -0.144 0.598 2.73
rm 506 0 1 -3.876 -0.568 -0.108 0.482 3.552
age 506 0 1 -2.333 -0.837 0.317 0.906 1.116
dis 506 0 1 -1.266 -0.805 -0.279 0.662 3.957
rad 506 0 1 -0.982 -0.637 -0.522 1.66 1.66
tax 506 0 1 -1.313 -0.767 -0.464 1.529 1.796
ptratio 506 0 1 -2.705 -0.488 0.275 0.806 1.637
black 506 0 1 -3.903 0.205 0.381 0.433 0.441
lstat 506 0 1 -1.53 -0.799 -0.181 0.602 3.545
medv 506 0 1 -1.906 -0.599 -0.145 0.268 2.987

After using the R scale-fundtion we have how standardized the values. Looking at the standardized summary table, it seems fairly even, however the crim variables seems to have a fairy large variance as the maximum stardardized value is three times larger than in any other category.

# Create quatiles from crim value as separate categories
crime <- cut(bsca$crim, breaks = quantile(bsca$crim), label =c("low","med_low", "med_high", "high"), include.lowest = TRUE)
bsca <- dplyr::select(bsca, -crim)
bsca <- data.frame(bsca, crime)

# Create test and training sets (80 % to train set)
ind <- sample(nrow(bsca),  size = nrow(bsca) * 0.8)
train <- bsca[ind,]
test <- bsca[-ind,]

# Copy correct crme categories from test set and remove from set
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
# Shamelessly copied from Datacamp excercise

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2351485 0.2549505 0.2574257 0.2524752 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.92296626 -0.9009418 -0.10655643 -0.8644534  0.41735289 -0.8509330
## med_low  -0.07949111 -0.2759888 -0.04298342 -0.5709646 -0.13455023 -0.3018332
## med_high -0.39571399  0.1480147  0.18195173  0.3988551  0.08386261  0.3806720
## high     -0.48724019  1.0171096 -0.07933396  1.0547891 -0.40860826  0.8086769
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8853003 -0.6796430 -0.7280939 -0.47053909  0.37312954 -0.75325821
## med_low   0.4058627 -0.5380946 -0.4748703 -0.05838825  0.31399003 -0.09882069
## med_high -0.3713910 -0.3877604 -0.3008162 -0.22018450  0.05211565  0.03921177
## high     -0.8683753  1.6382099  1.5141140  0.78087177 -0.92470369  0.88757270
##                 medv
## low       0.49224121
## med_low  -0.02489075
## med_high  0.13872274
## high     -0.67428113
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.08483102  0.735059303 -0.946528390
## indus   -0.03057688 -0.175694994  0.516979849
## chas    -0.10065380 -0.055712537 -0.005489093
## nox      0.40795329 -0.889560908 -1.334254603
## rm      -0.10133878 -0.136008952 -0.130604301
## age      0.25934018 -0.209862908  0.082722146
## dis     -0.09972813 -0.285641206  0.416202167
## rad      2.95005773  0.953348706  0.041864765
## tax      0.03553384  0.006594074  0.390011020
## ptratio  0.14302224 -0.109188558 -0.321861468
## black   -0.14229325  0.023673885  0.150810414
## lstat    0.20030651 -0.249856741  0.366853588
## medv     0.18070509 -0.442741021 -0.130299854
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9473 0.0388 0.0138
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the biplot and also quite easily from the coefficients list on the linear discriminats output we can see that rad variable, aka. index of accessibility to radial highways, has a very strong correlation for high crimes rates.

For the lower three categories of crime, it seems that area with larger plots sizes have lower crime rates. Also the nitrogen oxide concentration seems to a have similar but reverse effect (higher concentration, more crime).

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       20      11        1    0
##   med_low    4      17        2    0
##   med_high   1       6       15    0
##   high       0       0        0   25

The generated model seems to predict higher crime rate areas better. The highest being clear a separate cluster in the biplot most likely explains this and the three other categories there is more variance in the predict-algorithms results.

# Need to redo scaling version for K-mean to work
bsca <- scale(Boston)
dist_euclidian <- dist(bsca)
summary(dist_euclidian)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
kc <- 6
km <- kmeans(bsca, centers = 4)
TWCSS <- sapply(1:kc, function(k){kmeans(Boston, k)$tot.withinss})
bsca <- as.data.frame(bsca)
pairs(bsca[1:5], col = km$cluster)

pairs(bsca[6:10], col = km$cluster)

pairs(bsca[11:14], col = km$cluster)

qplot(x = 1:kc, y = TWCSS, geom = 'line')

Looking the paired data or the more informative total of within cluster sum of squares plot we can see that clustering to two groups would capture most differences in the variables.